Lip reading is the process of comprehending speech by interpreting lip movements. Because it can be used in audio-visual speech recognition, optimization, and separation, it has drawn much attention. Traditional solutions mainly relied on CNNs like ResNet in order to extract spatial information from the video frames. However, CNNs do not perform satisfactorily in capturing temporal correlations, which will make multi-modal systems more computationally expensive and increase latency. We combine RNNs and LSTMs when modeling temporal changes, which also face scaling challenges. In this paper, we propose DeepLip, a unified CNN-RNN- LSTM architecture for end-to-end visual speech recognition. DeepLip effectively integrates the feature of spatial property extraction with temporal embeddings. These embeddings leverage the strengths of both convolutional and recurrent layers to model both local and sequential dynamics well. They therefore work well for alignment-based training with CTC Loss that enables word and sentence level recognition at high levels. Our experiments, based on two datasets, English LRW and Mandarin LRW-1000, show that DeepLip outperforms the current state-of-the-art while being more efficient and cheaper to run.
Introduction
Visual Speech Recognition (VSR), commonly known as lip reading, is the process of understanding spoken language using visual cues—mainly lip, mouth, and surrounding facial movements. With the rise of deep learning, VSR has advanced significantly, proving valuable when audio is unavailable, distorted, or noisy, and in applications such as assistive devices for hearing-impaired individuals, surveillance, silent communication, and multimodal human-computer interaction.
Challenges in Lip Reading
Spatial vs. Temporal Modeling:
CNNs excel at extracting spatial features from video frames but struggle with long-term temporal dependencies.
RNNs and LSTMs handle temporal sequences but depend on rich spatial features.
Alignment Problem: Mapping video frames to text labels often requires complex preprocessing and alignment.
Variability: Differences in lighting, head pose, speaker habits, and phoneme articulation complicate recognition.
High Computational Cost: Processing video sequences is resource-intensive, especially for 3D CNNs and transformer-based models.
DeepLip: A Hybrid VSR System
DeepLip integrates spatial, temporal, and attention-based modules into a single end-to-end trainable framework, addressing the limitations of previous approaches.
Core Components:
3D Spatio-Temporal Embedding Module: Captures motion dynamics across frames without distorting spatial features.
2D CNN / Swin Transformer: Extracts spatial representations of the lips and surrounding face.
1D Convolutional Attention Module: Enhances temporal feature extraction efficiently for real-time or resource-constrained environments.
RNN-LSTM Decoder: Models long-range temporal dependencies in the video sequence.
CTC Loss Function: Provides alignment-free training, mapping visual sequences to text without frame-level annotations.
Key Advantages:
Robust to lighting, speaker variability, and head movement.
Supports silent speech recognition and multimodal interfaces.
Modular architecture allows flexibility for different datasets and real-time deployment on edge devices.
Efficient computational design reduces memory usage and latency.
Related Work
Early lip-reading systems relied on handcrafted features (e.g., Discrete Cosine Transform, geometric lip models) with HMMs for sequence modeling. These failed under real-world conditions.
Deep learning replaced handcrafted pipelines with CNN-based spatial encoders and RNN/LSTM temporal modules, enabling end-to-end learning.
Recent approaches incorporate 3D CNNs, transformers, and attention modules to capture fine-grained spatial-temporal patterns.
Hybrid CNN-RNN-LSTM frameworks remain state-of-the-art for continuous speech recognition and robust generalization.
Proposed Work Highlights
Integration of 2D CNN, 3D CNN, LSTM, and Transformer-based attention for efficient spatio-temporal learning.
Lightweight 1D Convolutional Attention reduces computational cost, suitable for real-time and edge devices.
CTC-based alignment-free decoding allows training without manual frame annotations.
Tested on benchmark datasets: LRW, GRID, LRW-1000, showing robust performance under varying conditions.
Technical Details
2D Convolution: Captures spatial features such as edges, contours, and textures of the lip region.
3D Convolution: Encodes motion across successive frames, capturing both appearance and temporal dynamics.
Temporal Modeling: LSTM units track long-range dependencies in lip sequences.
Attention Mechanisms: Transformer modules and 1D Convolutional Attention enhance temporal reasoning and efficiency.
Conclusion
In this study, we introduced DeepLip, a hybrid end-to-end visual speech recognition framework that skillfully combines Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory (LSTM) units to capture both the spatial and temporal aspects of lip movements. We created DeepLip. Using CNNs for strong spatial By incorporating properties and adding RNN-LSTM layers to capture sequential relationships, DeepLip learns a great deal about the spatio-temporal characteristics required for accurate lip reading. We also added temporal embeddings to the structure so that it would be easier to see small changes in motion across video frames. Connectionist Temporal Classification (CTC) Loss lets the model match input sequences with their transcriptions without having to label each frame. This makes the training process easier to change and grow.
We have done a lot of research on benchmark datasets, such as the English LRW and Mandarin LRW-1000. Our results indicate that DeepLip works just as well as or better than other state-of- the-art models, and it also makes the computer work less demanding. The model works well in a number of languages and does well on tests that ask it to find words and phrases. The architecture is strong but light, so it can be used in real time and in places with fewer resources, like mobile devices or embedded systems. DeepLip is modular, so it can work with a lot of different speech recognition backends or on its own. This feature makes it easy to add audio to models, use knowledge distillation (KD) methods, and make multi-modal systems that use both sound and sight cues to help people understand speech better. In the future, we want to look into these connections in more depth so that we can build a fast and accurate audio-visual speech recognition pipeline. We also want to look into using attention-based methods and other kinds of transformers to improve temporal modeling and scalability even more. DeepLip is a positive step toward making visual speech recognition systems that work in any language, have little lag time, and are useful in the real world.
References
[1] Y. M. Assael, B. Shillingford, S. Whiteson, and N. de Freitas, \"LipNet: End-to-End Sentence-level Lipreading,\" arXiv preprint arXiv:1611.01599, Nov. 2016.
[2] J. S. Chung and A. Zisserman, \"Lip Reading in the Wild,\" in Proc. Asian Conf. Comput. Vis. (ACCV), Taipei, Taiwan, Nov. 2016, pp. 87-103.
[3] J. S. Chung, A. Senior, O. Vinyals, and A. Zisserman, \"Lip Reading Sentences in the Wild,\" in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Honolulu, HI, USA, Jul. 2017, pp. 6327-6336.
[4] B. Shillingford et al., \"Large-Scale Visual Speech Recognition,\" arXiv preprint arXiv:1807.05162, Jul. 2018.
[5] Z. Zhou, G. Zhao, X. Hong, and M. Pietikäinen authored \"Deep Learning for Visual Speech Analysis: A Survey,\" arXiv preprint arXiv:2205.10839, May 2022.
[6] A. Chand and S. Jain, \"A Survey of Visual Speech Recognition Using Deep Learning,\" AIP Conf. Proc., vol. 2742, no. 1, p. 020021, Feb. 2024.
[7] Stanford CS231n Project Team, \"Lip Reading Using CNN and LSTM,\" CS231n Course Project Report, Stanford Univ., Stanford, CA, USA, Dec. 2016.
[8] A. Wand, R. K. Martinez, and T. Schultz, \"Lipreading with Long Short-Term Memory,\" in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. (ICASSP), Shanghai, China, Mar. 2016, pp. 2755- 2759.
[9] T. Stafylakis and G. Tzimiropoulos, \"Combining Residual and LSTM Networks for Lip Reading,\" in Proc. INTERSPEECH, Stockholm, Sweden, Aug. 2017, pp. 3277-3281.
[10] S. A. A. Jeevakumari et al., \"LipSyncNet: A Novel Deep Learning Approach for Visual Speech Recognition,\" IEEE Access, vol. 12, pp. 106201-106215, 2024.
[11] \"Automatic Lip-Reading Model using 3D-CNN & LSTM,\" i- manager\'s Journal of Pattern Recognition, vol. 10, no. 1, pp. 1- 12, Apr. 2023.
[12] T. Afouras, J. S. Chung, and A. Zisserman, \"Deep Audio- Visual Speech Recognition,\" IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 7, pp. 4007-4019, Jul. 2022.
[13] B. Martinez, P. Ma, S. Petridis, and M. Pantic, \"Lipreading Using Temporal Convolutional Networks,\" in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. 43 (ICASSP), Barcelona, Spain, May 2020, pp. 1763-1767.
[14] G. Potamianos, C. Neti, J. Luettin, and I. Matthews, \"Audio- Visual Automatic Speech Recognition: An Overview,\" in Issues in Visual and Audio-Visual Speech Processing, MIT Press, 2004, ch. 2, pp. 23-49.
[15] N. Taniguchi et al., \"An Overview of Deep-Learning-Based Audio-Visual Speech Recognition,\" IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1615-1629, 2021.
[16] Y. Cao, \"Lips Reading Using Deep Learning Architecture (LipReader++),\" M.S. thesis, Auckland Univ. Technol., New Zealand, 2024.
[17] \"Lip-Interpretation Using Deep Learning and CNN,\" Int. J. Sci. Res. Eng. Technol., vol. 11, no. 2, pp. 583-589, Apr. 2025.
[18] \"Lip Reading Using CNN and Bi-LSTM,\" Int. J. Creative Res. Thoughts, vol. 12, no. 6, pp. 297-305, Jun. 2024.
[19] J. K. Chorowski et al., \"Attention-Based Models for Speech Recognition,\" in Proc. Adv. Neural Inf. Process. Syst. (NeurIPS), 2015, pp. 577-585.
[20] \"LipNet: Deep Learning for Visual Speech Recognition,\" Int. J. Sci. Eng. Technol., vol. 13, no. 2, pp. 265-270, 2024.
[21] \"Lipnet: Deep Learning for Visual Speech Recognition,\" Int. J. Eng. Res. Technol., vol. 13, no. 12, pp. 1234-1240, Dec. 2024.
[22] M. Sheth, \"Exploration of Visual Speech Recognition with LipNet,\" CS231n Project Report, Stanford Univ., Stanford, CA, USA, 2025.
[23] \"LipNet: End-to-End Lipreading,\" Indian J. Data Mining, vol. 4, no. 1, pp. 52-60, Apr. 2024.
[24] \"VISUAL SPEECH RECOGNITION USING LIP READING (LipNet Inspired),\" IRJET, vol. 9, no. 4, pp. 365-370, Apr. 2022.
[25] \"LipNet—End-to-End Sentence Level Lip Reading,\" IJARCCE, vol. 12, no. 9, pp. 917-922, Oct. 2023.
[26] \"Automated Speaker Independent Visual Speech Recognition,\" arXiv preprint arXiv:2306.08314, Jun. 2023.
[27] \"Deep Learning-Based Approach for Arabic Visual Speech Recognition,\" Comput., Mater. Continua, vol. 71, no. 1, pp. 453- 470, 2021
[28] \"Viseme-based Lip-Reading using Deep Learning,\" Ph.D. dissertation, London South Bank Univ., U.K., 2023. 44
[29] \"A Comprehensive Review of Recent Advances in Deep Neural Networks for Visual Speech Recognition,\" IEEE Access, 2024.
[30] \"HNet: A deep learning-based hybrid network for speaker- dependent visual speech recognition,\" Health Inf. Sci. Syst., vol. 12, no. 1, pp. 1-15, 2024.
[31] \"Automatic Speech Recognition: A Survey of Deep Learning Techniques,\" Comput. Speech Lang., vol. 84, p. 101573, 2024.
[32] \"Lip Reading Using Neural Network and Deep Learning,\" Int. J. Eng. Sci. Adv. Technol., vol. 23, no. 9, pp. 048-053, 2023